📊 A Data-Driven Dive into Life Expectancy Worldwide: An Actuarial Exploration Analyzing Determinants of Longevity#

Author: Hamid Saebi Monfared

Course Project, UC Irvine, Math 10, Summer 2023

🚀 Introduction#

In an era where technological advancements and medical breakthroughs occur at an unprecedented pace, understanding the factors influencing life expectancy worldwide has never been more critical. This data-driven exploration delves deep into global life expectancy determinants, combining actuarial insight with data analysis. My journey aims to unravel the complex interplay of variables that shape longevity and, in doing so, shed light on the secrets of a long life.

📑 About the Dataset#

Originating from a CSV file on Kaggle (Life Expectancy (WHO)), the following provides an overview of its context and structure. This dataset stems from a comprehensive study examining factors influencing life expectancy. While past studies largely focused on demographic variables, income, and mortality rates, this dataset incorporates additional aspects like immunization effects and the Human Development Index, using data spanning 2000 to 2015 for various countries. Special attention is given to significant immunizations, including Hepatitis B, Polio, and Diphtheria. Sourced from the World Health Organization’s Global Health Observatory, the data comprises life expectancy and health metrics for 193 countries, alongside corresponding economic data from the United Nations. The data underwent meticulous cleaning, especially concerning missing values, primarily from lesser-known countries. Consequently, some nations, like Vanuatu and Tonga, were excluded, resulting in a final dataset of 22 columns and 2,938 rows. The predictor variables have been categorized into broad categories like immunization-related factors, mortality influences, economic elements, and social determinants.

🛠️ Data Processing#

Data processing involves the transformation of raw data into a structured format suitable for analysis. In this stage, I perform various tasks such as cleaning the data, handling missing values, and ensuring the data is consistent and reliable. This step is vital to guarantee the accuracy and integrity of my subsequent analyses, ensuring that my insights are based on trustworthy and meaningful information.

🧐 Initial Data Inspection#

Before diving into data cleaning, I’ll import and take a preliminary look at the dataset. This allows me to understand its structure, identify potential issues, and plan the subsequent cleaning steps more effectively.

import pandas as pd
df = pd.read_csv("Life Expectancy Data.csv")
df.head()
Country Year Status Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles ... Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling
0 Afghanistan 2015 Developing 65.0 263.0 62 0.01 71.279624 65.0 1154 ... 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1
1 Afghanistan 2014 Developing 59.9 271.0 64 0.01 73.523582 62.0 492 ... 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0
2 Afghanistan 2013 Developing 59.9 268.0 66 0.01 73.219243 64.0 430 ... 62.0 8.13 64.0 0.1 631.744976 31731688.0 17.7 17.7 0.470 9.9
3 Afghanistan 2012 Developing 59.5 272.0 69 0.01 78.184215 67.0 2787 ... 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8
4 Afghanistan 2011 Developing 59.2 275.0 71 0.01 7.097109 68.0 3013 ... 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5

5 rows × 22 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio                            2919 non-null   float64
 13  Total expenditure                2712 non-null   float64
 14  Diphtheria                       2919 non-null   float64
 15   HIV/AIDS                        2938 non-null   float64
 16  GDP                              2490 non-null   float64
 17  Population                       2286 non-null   float64
 18   thinness  1-19 years            2904 non-null   float64
 19   thinness 5-9 years              2904 non-null   float64
 20  Income composition of resources  2771 non-null   float64
 21  Schooling                        2775 non-null   float64
dtypes: float64(16), int64(4), object(2)
memory usage: 505.1+ KB

🗂️ Dataset Overview#

Now, I introduce the dataset that I will be using for this project. The data comprises various health metrics for different countries, collected over several years. Below, we describe each field in the dataset:

df.columns
Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

Field

Description

Country

Country

Year

Year

Status

Developed or Developing status

Life expectancy

Life Expectancy in age

Adult Mortality

Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)

infant deaths

Number of Infant Deaths per 1000 population

Alcohol

Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)

percentage expenditure

Expenditure on health as a percene of Gross Domestic Product per capita(%)

Hepatitis B

Hepatitis B (HepB) immunization coverage among 1-year-olds (%)

Measles

Measles - number of reported cases per 1000 population

BMI

Average Body Mass Index of entire population

under-five deaths

Number of under-five deaths per 1000 population

Polio

Polio (Pol3) immunization coverage among 1-year-olds (%)

Total expenditure

General government expenditure on health as a percene of total government expenditure (%)

Diphtheria

Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

HIV/AIDS

Deaths per 1 000 live births HIV/AIDS (0-4 years)

GDP

Gross Domestic Product per capita (in USD)

Population

Population of the country

thinness 1-19 years

Prevalence of thinness among children and adolescents for Age 10 to 19 (%)

thinness 5-9 years

Prevalence of thinness among children for Age 5 to 9(%)

Income composition of resources

Income composition of resources

Schooling

Number of years of Schooling(years)

df.Country.unique()
array(['Afghanistan', 'Albania', 'Algeria', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei Darussalam', 'Bulgaria',
       'Burkina Faso', 'Burundi', "Côte d'Ivoire", 'Cabo Verde',
       'Cambodia', 'Cameroon', 'Canada', 'Central African Republic',
       'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo',
       'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus',
       'Czechia', "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia',
       'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala',
       'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras',
       'Hungary', 'Iceland', 'India', 'Indonesia',
       'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati',
       'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic",
       'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithuania',
       'Luxembourg', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives',
       'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius',
       'Mexico', 'Micronesia (Federated States of)', 'Monaco', 'Mongolia',
       'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
       'Nauru', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua',
       'Niger', 'Nigeria', 'Niue', 'Norway', 'Oman', 'Pakistan', 'Palau',
       'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
       'Poland', 'Portugal', 'Qatar', 'Republic of Korea',
       'Republic of Moldova', 'Romania', 'Russian Federation', 'Rwanda',
       'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Vincent and the Grenadines', 'Samoa', 'San Marino',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia',
       'Solomon Islands', 'Somalia', 'South Africa', 'South Sudan',
       'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden',
       'Switzerland', 'Syrian Arab Republic', 'Tajikistan', 'Thailand',
       'The former Yugoslav republic of Macedonia', 'Timor-Leste', 'Togo',
       'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
       'Turkmenistan', 'Tuvalu', 'Uganda', 'Ukraine',
       'United Arab Emirates',
       'United Kingdom of Great Britain and Northern Ireland',
       'United Republic of Tanzania', 'United States of America',
       'Uruguay', 'Uzbekistan', 'Vanuatu',
       'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Yemen',
       'Zambia', 'Zimbabwe'], dtype=object)

I use the transpose() method to visualize the entire chart in a single view.

df.describe().transpose()
count mean std min 25% 50% 75% max
Year 2938.0 2.007519e+03 4.613841e+00 2000.00000 2004.000000 2.008000e+03 2.012000e+03 2.015000e+03
Life expectancy 2928.0 6.922493e+01 9.523867e+00 36.30000 63.100000 7.210000e+01 7.570000e+01 8.900000e+01
Adult Mortality 2928.0 1.647964e+02 1.242921e+02 1.00000 74.000000 1.440000e+02 2.280000e+02 7.230000e+02
infant deaths 2938.0 3.030395e+01 1.179265e+02 0.00000 0.000000 3.000000e+00 2.200000e+01 1.800000e+03
Alcohol 2744.0 4.602861e+00 4.052413e+00 0.01000 0.877500 3.755000e+00 7.702500e+00 1.787000e+01
percentage expenditure 2938.0 7.382513e+02 1.987915e+03 0.00000 4.685343 6.491291e+01 4.415341e+02 1.947991e+04
Hepatitis B 2385.0 8.094046e+01 2.507002e+01 1.00000 77.000000 9.200000e+01 9.700000e+01 9.900000e+01
Measles 2938.0 2.419592e+03 1.146727e+04 0.00000 0.000000 1.700000e+01 3.602500e+02 2.121830e+05
BMI 2904.0 3.832125e+01 2.004403e+01 1.00000 19.300000 4.350000e+01 5.620000e+01 8.730000e+01
under-five deaths 2938.0 4.203574e+01 1.604455e+02 0.00000 0.000000 4.000000e+00 2.800000e+01 2.500000e+03
Polio 2919.0 8.255019e+01 2.342805e+01 3.00000 78.000000 9.300000e+01 9.700000e+01 9.900000e+01
Total expenditure 2712.0 5.938190e+00 2.498320e+00 0.37000 4.260000 5.755000e+00 7.492500e+00 1.760000e+01
Diphtheria 2919.0 8.232408e+01 2.371691e+01 2.00000 78.000000 9.300000e+01 9.700000e+01 9.900000e+01
HIV/AIDS 2938.0 1.742103e+00 5.077785e+00 0.10000 0.100000 1.000000e-01 8.000000e-01 5.060000e+01
GDP 2490.0 7.483158e+03 1.427017e+04 1.68135 463.935626 1.766948e+03 5.910806e+03 1.191727e+05
Population 2286.0 1.275338e+07 6.101210e+07 34.00000 195793.250000 1.386542e+06 7.420359e+06 1.293859e+09
thinness 1-19 years 2904.0 4.839704e+00 4.420195e+00 0.10000 1.600000 3.300000e+00 7.200000e+00 2.770000e+01
thinness 5-9 years 2904.0 4.870317e+00 4.508882e+00 0.10000 1.500000 3.300000e+00 7.200000e+00 2.860000e+01
Income composition of resources 2771.0 6.275511e-01 2.109036e-01 0.00000 0.493000 6.770000e-01 7.790000e-01 9.480000e-01
Schooling 2775.0 1.199279e+01 3.358920e+00 0.00000 10.100000 1.230000e+01 1.430000e+01 2.070000e+01

🧹 Data Cleaning#

In this phase, I identify and rectify errors and inconsistencies in the data to enhance its quality and reliability. My aim is to ensure the dataset is accurate, consistent, and primed for analysis.

missing_data = df.isnull().sum()
print(missing_data)
Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

Let’s calculate and highlight the percentage of missing data in each feature of DataFrame, making it easier for us to grasp the extent of data incompleteness, a vital step in data quality assessment and preprocessing.

def checkna(df):
    missing_values = df.isna().sum().reset_index()
    missing_values.columns = ["Feature", "Missing_Values"]
    missing_values["Missing Values Percentage"]= round(missing_values.Missing_Values/len(df)*100).astype(str) + '%'
    return missing_values[missing_values.Missing_Values > 0 ]
checkna(df)
Feature Missing_Values Missing Values Percentage
3 Life expectancy 10 0.0%
4 Adult Mortality 10 0.0%
6 Alcohol 194 7.0%
8 Hepatitis B 553 19.0%
10 BMI 34 1.0%
12 Polio 19 1.0%
13 Total expenditure 226 8.0%
14 Diphtheria 19 1.0%
16 GDP 448 15.0%
17 Population 652 22.0%
18 thinness 1-19 years 34 1.0%
19 thinness 5-9 years 34 1.0%
20 Income composition of resources 167 6.0%
21 Schooling 163 6.0%

Important: In the preliminary stages of data cleaning, a straightforward approach would be to discard rows with missing values, a strategy often adopted in general practice and in our assignments. However, in this project, I am keen to venture beyond the basics and employ more sophisticated strategies to handle missing values, both to enhance the richness of our dataset and to explore actuarial concepts in greater depth. Thus, I embarked on a detailed process, utilizing various methods to manage missing values judiciously and to enrich this project with more advanced analytical techniques.

# After noticing a small typo in the column names that led to a KeyError, I corrected it by precisely matching 
# the column names, including the extra spaces. This will remove any leading and trailing whitespaces from 
# df column names.
df.columns = [col.strip() for col in df.columns]
print(df.columns)
Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years',
       'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')

As I explained above, in the process of data cleaning, a critical step is handling missing values appropriately. In this dataset, the ‘Life expectancy’ and ‘Adult Mortality’ columns stand as pivotal pillars for our analysis. Noticing a very small fraction of missing values in these columns, I opt to drop these rows instead of filling them with a central tendency measure such as the mean. This approach is grounded in maintaining the integrity and true distribution of this critical data, avoiding the introduction of bias that might stem from artificially altering the dataset with imputed values. This ensures that our subsequent analyses are based on reliable and authentic data, hence providing a strong foundation for our data-driven insights.

df.dropna(subset=['Life expectancy', 'Adult Mortality'], inplace=True)

For columns with a relatively smaller proportion of missing values — ‘Polio’, ‘Diphtheria’, ‘BMI’, ‘thinness 1-19 years’, and ‘thinness 5-9 years’ — I opt to impute the missing values with the mean value of the respective columns. This decision is anchored in a bid to retain as much data as possible, preventing the loss of valuable information that would occur if we were to eliminate these rows altogether. Imputing with the mean value, in this case, aids in maintaining the overall distribution of the dataset, thus providing a more robust ground for analysis. Moreover, considering that these columns have a smaller number of missing values, using the mean value for imputation introduces a minimal bias, allowing us to proceed with our analysis with a dataset that is both rich in information and statistically sound.

for column in ['Polio', 'Diphtheria', 'BMI', 'thinness  1-19 years', 'thinness 5-9 years']:
    df[column].fillna(df[column].mean(), inplace=True)
df.isnull().sum()
Country                              0
Year                                 0
Status                               0
Life expectancy                      0
Adult Mortality                      0
infant deaths                        0
Alcohol                            193
percentage expenditure               0
Hepatitis B                        553
Measles                              0
BMI                                  0
under-five deaths                    0
Polio                                0
Total expenditure                  226
Diphtheria                           0
HIV/AIDS                             0
GDP                                443
Population                         644
thinness  1-19 years                 0
thinness 5-9 years                   0
Income composition of resources    160
Schooling                          160
dtype: int64

Now let’s look at other columns with higher proportion of missing data:

Alcohol (7% missing values) Considering it has a moderate amount of missing data, we could choose to impute the missing values with the mean or median. To check if the distribution of the ‘Alcohol’ column is skewed, we generally plot a histogram or use a skewness function to find out the skewness of the data. For more clarification, I provide both:

import altair as alt

alt.Chart(df).mark_bar().encode(
    alt.X('Alcohol:Q', bin=alt.Bin(maxbins=30), title='Alcohol Consumption'),
    alt.Y('count():Q', title='Frequency')
).properties(
    title='Alcohol Distribution'
).configure_axisY(grid=True)
alcohol_skewness = df['Alcohol'].skew()
print(f"Skewness of the 'Alcohol' column: {alcohol_skewness}")
Skewness of the 'Alcohol' column: 0.5872759823338338

As we can see above, the skewness value of approximately 0.587 indicates a moderate right skew. In the presence of a skew, it is generally better to use the median to impute missing values, as it is less sensitive to extreme values compared to the mean. Thus, I fill the missing values in this column with the column’s median value.

df['Alcohol'].fillna(df['Alcohol'].median(), inplace=True)

Hepatitis B (19% missing values) The ‘Hepatitis B’ column contains a significant amount of missing values, constituting about 18.83% of the data. Imputing these many missing values using the mean or median could introduce substantial bias and reduce the variance of the dataset, potentially leading to incorrect analyses. Therefore, I decided to use the K-Nearest Neighbors (KNN) imputation strategy, which can maintain the underlying data distribution, offering a more accurate imputation compared to using the median or mean. This method leverages information from the ‘k’ most similar or ‘nearest’ observations to fill in missing values, preserving the dataset’s structure more effectively and dynamically incorporating the inherent patterns and distributions in the data.

How KNN Works KNN imputation identifies the ‘k’ most similar observations to the one with the missing value, based on other features, and then imputes the missing value based on the average (or other metric) of these ‘k’ observations. The similarity between observations is typically calculated using distance metrics such as Euclidean distance. It provides a more nuanced approach to imputation by considering the multidimensional nature of the data.

from sklearn.impute import KNNImputer

hepatitis_B_data = df[['Hepatitis B']]

imputer = KNNImputer(n_neighbors=5)
hepatitis_B_data_imputed = imputer.fit_transform(hepatitis_B_data)

df['Hepatitis B'] = hepatitis_B_data_imputed

Total expenditure (8% missing values) Before deciding on an imputation strategy for the ‘Total expenditure’ column, it’s imperative to understand the distribution of the data present in this column. Hence, I will determine the skewness of the distribution and visualize it through a histogram. By using the central tendencies (mean or median) for imputation, I am essentially substituting missing values with the most representative value of the column, maintaining the original data’s characteristics to a large extent, and providing a balanced approach without introducing additional complexities or potential sources of bias, unlike more sophisticated imputation methods such as KNN.

alt.Chart(df).mark_bar().encode(
    alt.X('Total expenditure', bin=alt.Bin(maxbins=30), title='Total Expenditure'),
    alt.Y('count()', title='Frequency')
).properties(
    title='Distribution of Total Expenditure',
    width=400,
    height=300
).configure_axisY(grid=True)
total_expenditure_skewness = df['Total expenditure'].skew()
print(f"Skewness of the 'Total expenditure' column: {total_expenditure_skewness}")
Skewness of the 'Total expenditure' column: 0.5772333235968542

Given that the ‘Total expenditure’ column has a moderate skew (0.577), it would indeed be more prudent to use the median to impute the missing values, as it will be less affected by any extreme values, preserving the underlying distribution of the data more faithfully than the mean would.

df['Total expenditure'].fillna(df['Total expenditure'].median(), inplace=True)

GDP (15% missing values) Given the economic parameter we are examining here, the most accurate way to address the missing values would ideally be to individually find the most reliable data to fill these gaps. However, due to time constraints and the challenging nature of accessing reliable data for a range of countries over numerous years, I opt for a statistical method to impute these missing values. It is a general observation that GDP feature tends to be positively skewed. High-income countries can have very high GDPs compared to the rest, pulling the average up, hence it might not accurately represent the central tendency of the dataset. In such scenarios where we anticipate a significant skewness, it is prudent first to check the skewness of the distribution. Depending on the result, we would choose the most appropriate measure of central tendency (mean or median) to impute the missing values to retain the original data distribution as much as possible.

alt.Chart(df).mark_bar().encode(
    alt.X("GDP:Q", bin=alt.Bin(maxbins=30), title='GDP'),
    alt.Y("count():Q", title='Frequency')
).properties(
    width=400,
    height=300,
    title='Distribution of GDP'
).configure_axisY(grid=True)
gdp_skewness = df['GDP'].skew()
print(f"Skewness of the 'GDP' column: {gdp_skewness}")
Skewness of the 'GDP' column: 3.202781401465919

The skewness value of approximately 3.20 indicates a significantly right-skewed distribution. This skewness tells us that there are countries with GDP values considerably higher than others, pulling the mean upwards. In light of this, utilizing the median would be a more representative approach to impute the missing values, as it is less affected by extremely high or low values. We should proceed by filling the missing values in the ‘GDP’ column with the median of the existing values to maintain the data’s original distribution as closely as possible.

df['GDP'].fillna(df['GDP'].median(), inplace=True)

Population (22% missing values) Given the demographic parameter we are analyzing here, the most accurate approach to address the missing values would be to fill in these gaps with the most reliable individual data available. However, considering the time and effort this would require, I opt for a statistical method to handle the missing data in the “Population” column. By employing the KNN imputation method, we retain the underlying data distribution, which generally results in more accurate imputations than using a singular statistic such as the mean or median. This choice is grounded in the desire to conserve the authentic data distribution while accommodating our practical constraints.

from sklearn.impute import KNNImputer

population_data = df[['Population']]

imputer = KNNImputer(n_neighbors=5)
population_data_imputed = imputer.fit_transform(population_data)

df['Population'] = population_data_imputed

Income composition of resources (6% missing values) Since this column has a lesser number of missing values compared to others, using the mean or median for imputation could work. Hence, we first need to understand the distribution of the data in this column.

histogram = alt.Chart(df).mark_bar().encode(
    alt.X('Income composition of resources', bin=alt.Bin(maxbins=30)),
    alt.Y('count()')
).properties(
    title='Income composition of resources Distribution',
    width=400,
    height=300
)

histogram = histogram.configure_axisY(grid=True).configure_axisX(grid=False)
histogram = histogram.configure_axis(
    titleFontSize=12,
    labelFontSize=10
)

histogram
icr_skewness = df['Income composition of resources'].skew()
print(f"Skewness of the 'Income composition of resources' column: {icr_skewness}")
Skewness of the 'Income composition of resources' column: -1.142141952852178

Given the negative skewness value of approximately -1.142, the distribution is moderately skewed to the left. Generally, when we encounter a skewed distribution, it is preferred to use the median for imputing missing values, as it is less sensitive to extreme values compared to the mean.

df['Income composition of resources'].fillna(df['Income composition of resources'].median(), inplace=True)

Schooling (6% missing values) Similarly, considering the small proportion of missing values, imputing with the mean or median can be a feasible option. Thus, Let’s look at the column distribution.

chart = alt.Chart(df).mark_bar().encode(
    alt.X('Schooling:Q', bin=alt.Bin(maxbins=30), axis=alt.Axis(title='Schooling')),
    alt.Y('count():Q', axis=alt.Axis(title='Frequency')),
)

chart = chart.configure_axisY(grid=True)
chart = chart.properties(width=400, title='Distribution of Schooling')

chart
schooling_skewness = df['Schooling'].skew()
print(f"Skewness of the 'Schooling' column: {schooling_skewness}")
Skewness of the 'Schooling' column: -0.5838842686144311

Given that the skewness of the ‘Schooling’ column is approximately -0.584, we can infer a moderate left skew in the distribution. In this scenario, it is more prudent to use the median for imputation, as it is less influenced by extreme values, helping to retain the original distribution of the dataset to a greater extent. Therefore, I have opted to fill the missing values in this column with the median value.

schooling_median = df['Schooling'].median()
df['Schooling'].fillna(schooling_median, inplace=True)
df.isnull().sum()
Country                            0
Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
BMI                                0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
HIV/AIDS                           0
GDP                                0
Population                         0
thinness  1-19 years               0
thinness 5-9 years                 0
Income composition of resources    0
Schooling                          0
dtype: int64

🕵️ Exploratory Data Analysis (EDA)#

Exploratory Data Analysis, or EDA, is a crucial initial step in the data analysis process, helping to understand the underlying patterns and characteristics present in the data. It allows us to identify relationships between variables, uncover anomalies, and grasp the essential trends within our dataset, laying a solid foundation for creating more accurate predictive models down the line.

🌏 Country Classification based on Development Status#

Before diving into the detailed analysis, it is prudent to understand the distribution of countries in our dataset based on their development status. This demarcation will aid us in discerning any disparities and tailoring our analysis to account for these differences.

Developed Countries: 32 Developing Countries: 151

Further, a quick analysis of the average life expectancy in these groups reveals a substantial gap: Developed Countries: 79.2 years Developing Countries: 67.11 years

This glaring difference in life expectancy based on development status sets a potent premise for our ensuing analyses, where we aspire to uncover patterns and infer conclusions about life expectancy in relation to a range of parameters, all the while considering the development status of the countries involved.

countries = df.loc[:, ['Country', 'Status']]
distinct_countries = countries.drop_duplicates(['Country'])
grouped_by_status = distinct_countries.groupby('Status').count()
grouped_by_status
Country
Status
Developed 32
Developing 151
round(df[['Status','Life expectancy']].groupby(['Status']).mean(),2)
Life expectancy
Status
Developed 79.20
Developing 67.11
data = pd.DataFrame({
    'Status': ['Developed', 'Developing'],
    'Mean Life Expectancy': [79.2, 67.11]
})

chart1 = alt.Chart(data).mark_bar().encode(
    x='Status:N',
    y='Mean Life Expectancy:Q',
    color=alt.Color('Status:N', scale=alt.Scale(domain=['Developed', 'Developing'], range=['#28d2c2', '#f075c2']), legend=None)
).properties(
    width=200,
    title='Mean Life Expectancy by Status'
)

chart2 = alt.Chart(df).mark_bar().encode(
    alt.X('Life expectancy', bin=alt.Bin(maxbins=30)), 
    y='count()',
).properties(
    title='Histogram of Life Expectancy'
)
chart1 | chart2
print("Top 10 Countries with Most Life Expectancy")
df.groupby("Country").agg({
    "Life expectancy":"mean"
}).reset_index().sort_values("Life expectancy", ascending = False).head(10)
Top 10 Countries with Most Life Expectancy
Country Life expectancy
82 Japan 82.53750
156 Sweden 82.51875
73 Iceland 82.44375
157 Switzerland 82.33125
58 France 82.21875
80 Italy 82.18750
151 Spain 82.06875
7 Australia 81.81250
119 Norway 81.79375
30 Canada 81.68750

Notable: Despite the substantial impacts of World War II, Japan has rebounded to secure the highest life expectancy globally, closely followed by Sweden, underlining both nations’ remarkable resilience and commitment to public health.

print("Top 10 Countries with Least Life Expectancy")
df.groupby("Country").agg({
    "Life expectancy":"mean"
}).reset_index().sort_values("Life expectancy", ascending = True).head(10)
Top 10 Countries with Least Life Expectancy
Country Life expectancy
143 Sierra Leone 46.11250
31 Central African Republic 48.51250
92 Lesotho 48.78125
3 Angola 49.01875
98 Malawi 49.89375
32 Chad 50.38750
43 Côte d'Ivoire 50.38750
182 Zimbabwe 50.48750
155 Swaziland 51.32500
118 Nigeria 51.35625

The following dataframe provides a summary of the average values of various health indicators and other features for both developed and developing countries, setting a statistical backdrop for further analysis.

df_status= pd.DataFrame(df.groupby('Status').mean())
df_status.drop(['Year'], inplace=True, axis=1)
df_status.head().transpose()
Status Developed Developing
Life expectancy 7.919785e+01 6.711147e+01
Adult Mortality 7.968555e+01 1.828332e+02
infant deaths 1.494141e+00 3.653477e+01
Alcohol 9.495508e+00 3.513055e+00
percentage expenditure 2.703600e+03 3.242620e+02
Hepatitis B 8.564888e+01 7.996735e+01
Measles 4.990059e+02 2.836619e+03
BMI 5.180391e+01 3.535995e+01
under-five deaths 1.810547e+00 5.073427e+01
Polio 9.373633e+01 8.017733e+01
Total expenditure 7.441289e+00 5.593071e+00
Diphtheria 9.347656e+01 7.995741e+01
HIV/AIDS 1.000000e-01 2.096896e+00
GDP 1.951733e+04 3.895746e+03
Population 7.942770e+06 1.378637e+07
thinness 1-19 years 1.320703e+00 5.598684e+00
thinness 5-9 years 1.296680e+00 5.641103e+00
Income composition of resources 8.360371e-01 5.864917e-01
Schooling 1.551309e+01 1.127496e+01

Time Series Plot In order to understand the trajectory of life expectancy globally, as well as the disparities between developed and developing nations over the years, I have charted a time series plot.

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="darkgrid")

plt.figure(figsize=(10, 6))

palette = {"Developed": "#28d2c2", "Developing": "#f075c2"}

sns.lineplot(x='Year', y='Life expectancy', hue='Status', data=df, palette=palette)

# Add a line for the overall trend
sns.lineplot(x='Year', y='Life expectancy', data=df, color='yellow', label='Overall')

plt.title('Life Expectancy Over the Years\n(From 2000 to 2015)', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Life Expectancy', fontsize=14)

plt.show()
../_images/76b4b55e111563b85c31ca8f30bbe31c86fb11ab6fefa4ac7bd5a50e0d657a63.png

Observations from the Time Series Plot The plot clearly exhibits a general upward trend in life expectancy, which is a testament to advancements in medical science and improved living conditions globally. Differentiating between developed and developing nations, we notice that developed countries have consistently maintained a higher life expectancy compared to developing nations. However, it is encouraging to observe that the gap seems to be narrowing down over recent years, hinting at a positive growth in health conditions globally. The confidence intervals showcase a consistent rise, albeit with occasional dips, indicating periods where life expectancy might have been affected due to various global events.

Boxplot To further dissect the discrepancies in life expectancy between developed and developing nations, I constructed a boxplot. This visualization succinctly demonstrates the distribution of life expectancy within these groups, providing insights into the median, quartiles, and potential outliers in the data.

sns.set_style("darkgrid")

plt.figure(figsize=(10,6))

sns.boxplot(x='Status', y='Life expectancy', data=df, palette={"Developed": "#28d2c2", "Developing": "#f075c2"})

plt.title('Boxplot of Life Expectancy by Status', fontsize=16)
plt.xlabel('Status', fontsize=14)
plt.ylabel('Life Expectancy', fontsize=14)

plt.show()
../_images/c34a2303d80982c4f509a156d57503f1a03692d882294957110892603b8f12e0.png

Observations from the Boxplot Upon analysis, it is visible that developed countries exhibit a higher median life expectancy, emphasizing a pronounced advantage in terms of health and longevity. Moreover, the tighter interquartile range for developed nations suggests a more homogeneous distribution of life expectancy, likely owing to more uniformly distributed healthcare amenities and living conditions. In contrast, developing countries showcase a wider spread, indicating more significant disparities within the group, possibly stemming from a mix of nations with varying levels of healthcare infrastructure.

📉 Descriptive Statistics#

Having thoroughly cleaned our dataset,in this section, I delve into the summary statistics of each variable present in our dataset. The table below presents the count, mean, standard deviation (SD), minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum values for each column. These metrics provide a foundational understanding of the central tendency, dispersion, and shape of the dataset’s distribution.By analyzing these metrics, I aim to build a foundational understanding of the dataset, paving the way for deeper analysis in the subsequent sections.

df.describe().transpose()
count mean std min 25% 50% 75% max
Year 2928.0 2.007500e+03 4.610560e+00 2000.00000 2003.750000 2.007500e+03 2.011250e+03 2.015000e+03
Life expectancy 2928.0 6.922493e+01 9.523867e+00 36.30000 63.100000 7.210000e+01 7.570000e+01 8.900000e+01
Adult Mortality 2928.0 1.647964e+02 1.242921e+02 1.00000 74.000000 1.440000e+02 2.280000e+02 7.230000e+02
infant deaths 2928.0 3.040745e+01 1.181144e+02 0.00000 0.000000 3.000000e+00 2.200000e+01 1.800000e+03
Alcohol 2928.0 4.559167e+00 3.920534e+00 0.01000 1.107500 3.770000e+00 7.400000e+00 1.787000e+01
percentage expenditure 2928.0 7.403212e+02 1.990931e+03 0.00000 4.853964 6.561145e+01 4.426143e+02 1.947991e+04
Hepatitis B 2928.0 8.096084e+01 2.253136e+01 1.00000 80.960842 8.700000e+01 9.600000e+01 9.900000e+01
Measles 2928.0 2.427856e+03 1.148597e+04 0.00000 0.000000 1.700000e+01 3.622500e+02 2.121830e+05
BMI 2928.0 3.823539e+01 1.985018e+01 1.00000 19.400000 4.300000e+01 5.610000e+01 7.760000e+01
under-five deaths 2928.0 4.217930e+01 1.607005e+02 0.00000 0.000000 4.000000e+00 2.800000e+01 2.500000e+03
Polio 2928.0 8.254830e+01 2.334055e+01 3.00000 78.000000 9.300000e+01 9.700000e+01 9.900000e+01
Total expenditure 2928.0 5.916257e+00 2.385963e+00 0.37000 4.370000 5.750000e+00 7.330000e+00 1.760000e+01
Diphtheria 2928.0 8.232142e+01 2.362958e+01 2.00000 78.000000 9.300000e+01 9.700000e+01 9.900000e+01
HIV/AIDS 2928.0 1.747712e+00 5.085542e+00 0.10000 0.100000 1.000000e-01 8.000000e-01 5.060000e+01
GDP 2928.0 6.627390e+03 1.331639e+04 1.68135 578.797095 1.764974e+03 4.793631e+03 1.191727e+05
Population 2928.0 1.276454e+07 5.390628e+07 34.00000 418120.500000 3.640009e+06 1.276454e+07 1.293859e+09
thinness 1-19 years 2928.0 4.850622e+00 4.396597e+00 0.10000 1.600000 3.400000e+00 7.100000e+00 2.770000e+01
thinness 5-9 years 2928.0 4.881423e+00 4.484890e+00 0.10000 1.600000 3.400000e+00 7.200000e+00 2.860000e+01
Income composition of resources 2928.0 6.301281e-01 2.054400e-01 0.00000 0.504000 6.770000e-01 7.730000e-01 9.480000e-01
Schooling 2928.0 1.201605e+01 3.254407e+00 0.00000 10.300000 1.230000e+01 1.410000e+01 2.070000e+01

Observation:

  • The Life Expectancy variable has a mean value of approximately 69.22 years, with a standard deviation of 9.52, indicating a moderate variability in life expectancy across different entries in the dataset.

  • The GDP variable exhibits a high standard deviation, suggesting a wide spread in the GDP values.

  • The Population variable also showcases a vast range, with a maximum value significantly larger than the 75th percentile, indicating the presence of outliers.

Now let’s look at the skewness of some of the key variables in your dataset:

GDP (Skewness: 3.2028) The positive skewness value indicates that the distribution of the GDP column is skewed to the right. Most of the countries have a GDP value less than the mean, with a few countries having a significantly higher GDP, which pulls the mean upwards. Right-skewed distributions (like GDP) often contain a few extremely high values that can potentially act as outliers, influencing statistical analyses, and might benefit from a transformation to normalize the data.

Income Composition of Resources (Skewness: -1.1421) The negative skewness value indicates a distribution skewed to the left, meaning most countries have a value that is greater than the mean. It suggests a majority of the countries have a higher income composition of resources.

Schooling (Skewness: -0.5838) Similar to the income composition of resources column, the schooling column also exhibits a leftward skew. It indicates that many countries have higher average years of schooling, with a majority lying above the mean value. Left-skewed distributions (like schooling and income composition of resources) suggest that most countries perform above the average, which might indicate a general global progress in these areas.

Population (Skewness: 18.0108) The population column has a very high positive skewness value, indicating a substantial right skew in the distribution. This suggests that the majority of countries have small populations, while a few countries have extremely large populations, creating a long tail on the right side of the distribution. The extreme skewness suggests that there are countries with populations vastly larger than others, which might significantly influence the analyses, depending on how the population variable is used. Hence, it might be beneficial to log-transform this variable to attain a more normal distribution, which can facilitate better interpretation and visualization of the data in analyses such as linear regression.

Further, I will visualize these distributions to get a more nuanced understanding of each variable’s distribution.

🧮 Data Transformation#

Before we proceed to data visualization, it is prudent to address the high skewness observed in the “Population” column. Applying a logarithmic transformation can help in reducing the skewness, bringing the data closer to a normal distribution which can facilitate better interpretation and visualization.

Let’s apply this transformation and visualize the outcome through a histogram to verify the efficacy of this strategy:

import numpy as np

# Applying logarithmic transformation to reduce the impact of extreme values
df['Population_log'] = np.log1p(df['Population'])

print('Skewness of the original "Population" column:', df['Population'].skew())
print('Skewness of the "Population_log" column:', df['Population_log'].skew())
Skewness of the original "Population" column: 18.010899275129972
Skewness of the "Population_log" column: -0.9298215014918261
# Applying logarithmic transformation to the 'Population' column
df['Population_log'] = np.log(df['Population'] + 1)

# Before Transformation
before_transform = alt.Chart(df).mark_bar(color='steelblue').encode(
    alt.X('Population', bin=alt.Bin(maxbins=30)),
    y='count()'
).properties(
    title='Before Log Transformation',
    width=300,
    height=300
)

# After Transformation
after_transform = alt.Chart(df).mark_bar(color='#2E8B57').encode(
    alt.X('Population_log', bin=alt.Bin(maxbins=30)),
    y='count()'
).properties(
    title='After Log Transformation',
    width=300,
    height=300
)

before_transform | after_transform

🎨 Data Visualization#

In the realm of Exploratory Data Analysis (EDA), data visualization stands as a central pillar, aiding in the intuitive and efficient representation of data insights through graphical means. It is a powerful tool that facilitates the identification of underlying patterns and trends that are often not discernible in tabulated data. Moreover, it enables the detection of outliers and anomalies, enhancing the data cleaning process. Through crafting visual narratives, we can convey complex data stories compellingly and accessibly, setting a robust foundation for informed decision-making as we delve deeper into our data analysis journey.

Understanding the Correlation Matrix

In data analysis and statistics, a correlation matrix is a valuable tool that helps us understand the relationships between variables in a dataset. It provides insights into how changes in one variable might be associated with changes in another. The correlation matrix is a square matrix where each cell represents the correlation coefficient between two variables.

The correlation coefficient quantifies the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where:

  • A positive correlation (close to 1) indicates that as one variable increases, the other tends to increase as well.

  • A negative correlation (close to -1) indicates that as one variable increases, the other tends to decrease.

  • A correlation close to 0 suggests little to no linear relationship between the variables.

Use of the Correlation Matrix in My Final Project

In this project, I employed the correlation matrix as a fundamental analytical tool for several reasons:

  1. Identifying Relationships: The correlation matrix allowed me to identify and quantify relationships between different variables in my dataset. This was crucial for understanding which variables might be influencing each other.

  2. Feature Selection: By examining the correlation values, I could identify pairs of variables that were highly correlated. This information was valuable for feature selection, as highly correlated variables might not provide much additional information and could be candidates for elimination to reduce model complexity.

  3. Data Exploration: The correlation matrix provided a visual representation of the data’s internal structure. It helped me spot patterns and dependencies that might not have been apparent through individual variable analysis.

  4. Hypothesis Testing: I used the correlation coefficients to test hypotheses about the relationships between variables. For example, I might have hypothesized that variable A and variable B were positively correlated, and the correlation matrix allowed me to confirm or reject this hypothesis.

Overall, the correlation matrix was an essential tool in my data analysis toolkit, aiding in data exploration, feature selection, and hypothesis testing. It helped me gain a deeper understanding of the relationships within my dataset and make informed decisions in my final project.

correlation_matrix = df.corr()
correlation_matrix
Year Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles BMI under-five deaths ... Total expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income composition of resources Schooling Population_log
Year 1.000000 0.170033 -0.079052 -0.036464 -0.065644 0.032723 0.090739 -0.081840 0.104094 -0.041980 ... 0.071395 0.134333 -0.138789 0.093170 0.015180 -0.044835 -0.047888 0.235866 0.207311 0.043495
Life expectancy 0.170033 1.000000 -0.696359 -0.196557 0.390674 0.381864 0.204566 -0.157586 0.562453 -0.222529 ... 0.209588 0.476442 -0.556556 0.430991 -0.019629 -0.472778 -0.467231 0.688591 0.717314 -0.050995
Adult Mortality -0.079052 -0.696359 1.000000 0.078756 -0.191066 -0.242860 -0.139146 0.031176 -0.383641 0.094146 ... -0.112176 -0.273602 0.523821 -0.281715 -0.012509 0.300262 0.305767 -0.436268 -0.435926 0.058166
infant deaths -0.036464 -0.196557 0.078756 1.000000 -0.113919 -0.085906 -0.179724 0.501038 -0.227427 0.996628 ... -0.126471 -0.175524 0.024955 -0.103175 0.548547 0.465590 0.471229 -0.141329 -0.192421 0.186554
Alcohol -0.065644 0.390674 -0.191066 -0.113919 1.000000 0.344228 0.073398 -0.050490 0.322657 -0.110801 ... 0.302242 0.214494 -0.047314 0.312735 -0.030360 -0.419114 -0.408054 0.420009 0.499675 -0.011618
percentage expenditure 0.032723 0.381864 -0.242860 -0.085906 0.344228 1.000000 0.011988 -0.056831 0.230976 -0.088152 ... 0.177355 0.143967 -0.098230 0.901803 -0.024704 -0.252228 -0.253761 0.375234 0.387937 -0.025126
Hepatitis B 0.090739 0.204566 -0.139146 -0.179724 0.073398 0.011988 1.000000 -0.090827 0.139102 -0.185377 ... 0.060578 0.498359 -0.103061 0.059125 -0.110472 -0.106911 -0.110112 0.151082 0.165111 -0.009090
Measles -0.081840 -0.157586 0.031176 0.501038 -0.050490 -0.056831 -0.090827 1.000000 -0.176019 0.507718 ... -0.104294 -0.142154 0.030673 -0.069531 0.236236 0.224516 0.220774 -0.110884 -0.121817 0.075905
BMI 0.104094 0.562453 -0.383641 -0.227427 0.322657 0.230976 0.139102 -0.176019 1.000000 -0.237833 ... 0.227740 0.283995 -0.243575 0.278342 -0.063235 -0.530805 -0.537784 0.478476 0.517918 -0.025598
under-five deaths -0.041980 -0.222529 0.094146 0.996628 -0.110801 -0.088152 -0.185377 0.507718 -0.237833 1.000000 ... -0.128161 -0.196065 0.037783 -0.106446 0.535889 0.467620 0.472091 -0.159022 -0.207801 0.189267
Polio 0.094158 0.462592 -0.273295 -0.171049 0.213793 0.147608 0.406308 -0.136440 0.285168 -0.189120 ... 0.137069 0.672130 -0.159843 0.191831 -0.035148 -0.220920 -0.221702 0.353453 0.383582 -0.017235
Total expenditure 0.071395 0.209588 -0.112176 -0.126471 0.302242 0.177355 0.060578 -0.104294 0.227740 -0.128161 ... 1.000000 0.152581 0.000752 0.113703 -0.066584 -0.267536 -0.274105 0.154618 0.234150 -0.093552
Diphtheria 0.134333 0.476442 -0.273602 -0.175524 0.214494 0.143967 0.498359 -0.142154 0.283995 -0.196065 ... 0.152581 1.000000 -0.165135 0.183847 -0.025699 -0.228790 -0.222060 0.367749 0.386978 -0.013591
HIV/AIDS -0.138789 -0.556556 0.523821 0.024955 -0.047314 -0.098230 -0.103061 0.030673 -0.243575 0.037783 ... 0.000752 -0.165135 1.000000 -0.123029 -0.027386 0.203416 0.206637 -0.247559 -0.220577 -0.025369
GDP 0.093170 0.430991 -0.281715 -0.103175 0.312735 0.901803 0.059125 -0.069531 0.278342 -0.106446 ... 0.113703 0.183847 -0.123029 1.000000 -0.025054 -0.266365 -0.269977 0.436574 0.434110 0.008384
Population 0.015180 -0.019629 -0.012509 0.548547 -0.030360 -0.024704 -0.110472 0.236236 -0.063235 0.535889 ... -0.066584 -0.025699 -0.027386 -0.025054 1.000000 0.236239 0.234055 -0.007874 -0.029843 0.314948
thinness 1-19 years -0.044835 -0.472778 0.300262 0.465590 -0.419114 -0.252228 -0.106911 0.224516 -0.530805 0.467620 ... -0.267536 -0.228790 0.203416 -0.266365 0.236239 1.000000 0.938953 -0.407905 -0.452171 0.076299
thinness 5-9 years -0.047888 -0.467231 0.305767 0.471229 -0.408054 -0.253761 -0.110112 0.220774 -0.537784 0.472091 ... -0.274105 -0.222060 0.206637 -0.269977 0.234055 0.938953 1.000000 -0.397398 -0.441876 0.075623
Income composition of resources 0.235866 0.688591 -0.436268 -0.141329 0.420009 0.375234 0.151082 -0.110884 0.478476 -0.159022 ... 0.154618 0.367749 -0.247559 0.436574 -0.007874 -0.407905 -0.397398 1.000000 0.799817 0.016052
Schooling 0.207311 0.717314 -0.435926 -0.192421 0.499675 0.387937 0.165111 -0.121817 0.517918 -0.207801 ... 0.234150 0.386978 -0.220577 0.434110 -0.029843 -0.452171 -0.441876 0.799817 1.000000 -0.030052
Population_log 0.043495 -0.050995 0.058166 0.186554 -0.011618 -0.025126 -0.009090 0.075905 -0.025598 0.189267 ... -0.093552 -0.013591 -0.025369 0.008384 0.314948 0.076299 0.075623 0.016052 -0.030052 1.000000

21 rows × 21 columns

Note: I could use Altair to create a static heatmap. However, while it could create static heatmaps, it may not provide the same level of interactivity as Seaborn for this particular use case. Hence, I use Seaborn for generating static correlation heatmaps, which offers a more straightforward approach and allows for convenient visualization of the correlation structure in the data.

plt.figure(figsize=(16,12))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
../_images/02c50c97fb1873b1ff24f0f59c047137ad594322f13f406bb08f325991cf9e60.png

Univariate Distribution Analysis To better understand the individual characteristics of each variable in our dataset, I have visualized the distribution and spread of each variable using histograms and box plots.

Histograms: These plots help us to understand the distribution of each variable by showing the frequency of different ranges of values that each variable takes. It also allows us to identify any skewness in the data and to understand the central tendency of each variable.

Box plots: On the other hand, box plots allow us to visualize the central tendency and spread of the data while also identifying any potential outliers in each variable. The ‘box’ in the plot represents the interquartile range (IQR) where the bulk of the data values lie, and the ‘whiskers’ extend to the smallest and largest values within 1.5 times the IQR from the first and third quartiles, respectively.

columns_to_plot = [
    'Life expectancy', 'Adult Mortality', 'infant deaths', 'Alcohol', 
    'percentage expenditure', 'Hepatitis B', 'Measles', 'BMI', 
    'under-five deaths', 'Polio', 'Total expenditure', 'Diphtheria', 
    'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years', 
    'thinness 5-9 years', 'Income composition of resources', 'Schooling'
]

for col in columns_to_plot:
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    sns.histplot(df, x=col, hue="Status", element="step", stat="density", common_norm=False, kde=True, ax=axes[0])
    axes[0].set_title(f'Distribution of {col}')
    axes[0].set_xlabel(col)
    axes[0].set_ylabel('Density')
    
    sns.boxplot(x=df[col], y=df["Status"], ax=axes[1])
    axes[1].set_title(f'Box plot of {col}')
    axes[1].set_xlabel(col)
    
    plt.tight_layout()
    plt.show()
../_images/ff9f26117c5f183730ff5a9289ca868ffd580f9f2f573f4f3c0b9c66a0b21b3f.png ../_images/5f0eb8e7e05e0f876d8fa38af0df940f8aeee21b9d0a9213df8e83d10993311a.png ../_images/915813007b5c521247a31fd33e40cf03fe30d56f14e239927e39d94b9edb6c5d.png ../_images/b2699c577c6a45bdecee558cea86d1ea8fa00929e758ca8ec5631958e415d1dd.png ../_images/4d0f5c0bb0e77496e620d803bd30bcf8a46a14e0ce55d0ce0e4d030ed7457089.png ../_images/f03001c2d84ad0a46d2d73f5a48996d78326a5b18d9c7546c6051e316f1fd527.png ../_images/c4bc48a573108270f0e825057e577fffe5924a95f5377278a6559080568dd12f.png ../_images/e8a0945d721da2ed619ff8771cd6a3cc0e958d74def08800c6b7591d5b0edeac.png ../_images/8295f4509fa9a9249482f63f3cc2764ae471fe47ab73cf4d21bf64dd6ef82713.png ../_images/fa25637448bfe5bb0ff96084fcf9518849e258428d6aac6a53875cdf1e8fd392.png ../_images/911873d2819677b3139f982b1020b18050c67bdc32a7d568c19690138c3647ac.png ../_images/4c0f6a821646b5ac4d298ac7a1f07d295086872488da3fa911539ae8de117679.png ../_images/70b11047eff42d4d9aeaf805dac0e668e5e696eee0df427adaa5728070b553fd.png ../_images/550a206534fdf9648138fb7a7c232423b29962ab44cba5cbc34160eddd7cbebc.png ../_images/79b22e602e2c150da9114eebb3eddf773bca60ef0c12fda7895c900c6d97e73b.png ../_images/a864993d7bf143e8a4d4468dec68293b2ad0819cf827fc075fa2ab5df3302537.png ../_images/db08e2265f0e806d52cf6ba7624fbeadb364e11393a6e78d00095c5018c81293.png ../_images/ce83fcf510f6516bd172518ea26a5b6f8215c5d9e6167c94f160ae980af3cc97.png ../_images/bb57a3d24d3e837a6606bf6e49a9ed8c1203b08684afe12a4b52b1d5f3f4f08c.png

Observations on Distributions and Outliers

From the histograms and box plots presented above, I can infer the distributions of various features in the dataset. The KDE lines in the histograms provide a smooth curve representing the distribution trend of the data, helping us visualize the underlying distribution more clearly.

We can identify the presence of outliers in several features through the box plots. These outliers represent values significantly different from the rest of the data. Despite noticing these outliers, I have chosen to retain them in our analysis to stay true to the original dataset and for a more realistic representation in this learning exercise.

Pair Plot Analysis A pair plot allows us to see both the distribution of single variables and the relationships between two variables. Here, we are examining the relationships between some key factors — Life Expectancy, Adult Mortality, Alcohol, BMI, GDP, and Schooling — to better understand how they interrelate.

sns.pairplot(df[['Life expectancy', 'Adult Mortality', 'Alcohol', 'BMI', 'GDP', 'Schooling', 'Status']], hue='Status')
plt.show()
../_images/c2f04c7f84aaa8215d06e8e3230bea68d1297ea914fdd0af1d81dad14f309b11.png

Observation: Life Expectancy and Adult Mortality: There appears to be a negative correlation between life expectancy and adult mortality. As adult mortality increases, life expectancy seems to decrease, which is a logical relationship.

Life Expectancy and Alcohol: It is somewhat difficult to discern a clear trend between life expectancy and alcohol consumption from the scatter plot. Further statistical analysis might provide clearer insights.

Life Expectancy and BMI: We might observe a certain degree of positive correlation here, indicating that higher BMI values are associated with greater life expectancy, up to a point. It is important to note that extremely high BMI might have adverse effects on health.

Life Expectancy and GDP: There appears to be a positive correlation between GDP and life expectancy, indicating that higher GDP might be associated with a longer life expectancy, possibly due to better healthcare and living conditions in wealthier countries.

Life Expectancy and Schooling: There seems to be a positive correlation here as well, showing that higher levels of schooling might be associated with a longer life expectancy, potentially due to a variety of factors including better knowledge regarding health and access to better healthcare services.

💹 Regression Analyses#

Regression analyses allow us to understand the relationships between variables in our dataset. By creating models that detail the interactions between different factors, we can gain insights into how various aspects influence life expectancy. In this section, I will delve into linear and logistic regression analyses to further analyze and interpret the dataset. Let’s proceed by examining linear regression first.

  • Linear Regression: This analysis will help us explore how various health and economic factors linearly affect life expectancy. We aim to develop a model that can predict life expectancy based on these factors.

  • Logistic Regression: This analysis, on the other hand, will aid in predicting the categorical outcome (developed or developing status) of a country based on different indicators. This model will classify countries into ‘Developed’ or ‘Developing’ based on a series of variables.

By performing these analyses, I aim to derive meaningful insights and understand the underlying patterns and trends in the data. Let’s proceed to delve deeper and uncover these relationships.

📈 Linear Regression:#

Linear regression helps in understanding the linear relationship between the dependent and independent variables. In this subsection, we will identify which factors significantly influence life expectancy and to what extent. Before proceeding with the linear regression analysis, I first examine the correlations between the dependent variable (Life expectancy) and various independent variables available in our dataset. By doing this, I aim to identify the variables that have a strong linear relationship with life expectancy. This initial step helps in selecting appropriate variables for building our regression model, ensuring that I include only the most influential factors, thus improving the predictive power and interpretability of our model.

correlations = df.corr()
correlations['Life expectancy'].sort_values(ascending=False)
Life expectancy                    1.000000
Schooling                          0.717314
Income composition of resources    0.688591
BMI                                0.562453
Diphtheria                         0.476442
Polio                              0.462592
GDP                                0.430991
Alcohol                            0.390674
percentage expenditure             0.381864
Total expenditure                  0.209588
Hepatitis B                        0.204566
Year                               0.170033
Population                        -0.019629
Population_log                    -0.050995
Measles                           -0.157586
infant deaths                     -0.196557
under-five deaths                 -0.222529
thinness 5-9 years                -0.467231
thinness  1-19 years              -0.472778
HIV/AIDS                          -0.556556
Adult Mortality                   -0.696359
Name: Life expectancy, dtype: float64

The two highest correlations with “Life Expectancy” are “Schooling” (0.717) and “Income Composition of Resources” (0.689). This indicates a strong positive linear relationship between these variables and life expectancy, thus justifying our choice to further analyze them through linear regression in the subsequent charts.

Notable: The variable “Alcohol” has a correlation coefficient of 0.390674 with life expectancy. However, it is important to interpret this with caution. This positive correlation does not necessarily mean that increased alcohol consumption directly leads to a longer life expectancy. There might be underlying factors influencing this relationship, such as economic factors, healthcare quality, or lifestyle differences that tend to accompany higher alcohol consumption in the dataset being considered. Moreover, high alcohol consumption is generally associated with numerous health risks. It could be that in the contexts represented in this data, areas with higher alcohol consumption also have other factors positively influencing life expectancy, giving rise to this positive correlation.

from sklearn.linear_model import LinearRegression

regg = LinearRegression()

regg.fit(df[["Schooling"]], df["Life expectancy"])

print("Coefficient/Slope of line of best fit: ", regg.coef_[0])
print("Y-Intercept of line of best fit: ", regg.intercept_)

df["pred"] = regg.predict(df[["Schooling"]])

color_scale = alt.Scale(domain=['Developed', 'Developing'], range=['#00FFA5', '#FF66B2'])

c7 = alt.Chart(df).mark_line(color="yellow").encode(
    x=alt.X("Schooling:Q", title="Schooling (years)", scale=alt.Scale(zero=False, nice=True)),
    y=alt.Y("pred:Q", title="Predicted Life Expectancy (years)", scale=alt.Scale(zero=False, nice=True))
)

c4 = alt.Chart(df).mark_circle().encode(
    x=alt.X("Schooling:Q", title="Schooling (years)", scale=alt.Scale(zero=False, nice=True)),
    y=alt.Y("Life expectancy:Q", title="Life Expectancy (years)", scale=alt.Scale(zero=False, nice=True)),
    color=alt.Color("Status:N", scale=color_scale, legend=alt.Legend(title="Development Status", titleColor='white', labelColor='white', titleFontSize=12, labelFontSize=10)),
    tooltip=["Schooling", "Income composition of resources", "Adult Mortality", "BMI", "GDP", "Status"]
)

(c4 + c7).properties(
    title={
        "text": 'Linear Regression Model',
        "color": "white",
        "fontSize": 20
    },
    width=600,
    height=400,
    background='#2A2A2A'
).configure_axis(
    grid=True,
    labelColor='white',
    titleColor='white',
    domainColor='white',
    tickColor='white'
).configure_view(
    stroke=None,
).configure_title(
    fontSize=24,
    anchor='start',
    color='white'
)
Coefficient/Slope of line of best fit:  2.0991846069651716
Y-Intercept of line of best fit:  44.001020482631276
regg = LinearRegression()

regg.fit(df[["Schooling"]], df["Life expectancy"])

print("Coefficient/Slope of line of best fit: ", regg.coef_[0])
print("Y-Intercept of line of best fit: ", regg.intercept_)

df["pred"] = regg.predict(df[["Schooling"]])

df['Schooling Range'] = pd.cut(df['Schooling'], bins=[0, 5, 10, 15, 20, 25], labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

c7 = alt.Chart(df).mark_line(color="magenta").encode(
    x=alt.X("Schooling:Q", title="Schooling (years)", scale=alt.Scale(zero=False, nice=True)),
    y=alt.Y("pred:Q", title="Predicted Life Expectancy (years)", scale=alt.Scale(zero=False, nice=True))
)

c4 = alt.Chart(df).mark_circle().encode(
    x=alt.X("Schooling:Q", title="Schooling (years)", scale=alt.Scale(zero=False, nice=True)),
    y=alt.Y("Life expectancy:Q", title="Life Expectancy (years)", scale=alt.Scale(zero=False, nice=True)),
    color='Schooling Range:N',
    tooltip = ["Schooling", "Income composition of resources", "Adult Mortality", "BMI", "GDP", 'Schooling Range']
)

(c4 + c7).properties(
    title={
        "text": 'Linear Regression Model',
        "fontSize": 20
    },
    width=600,
    height=400,
    background='#F0F0F0'
).configure_axis(
    grid=True
).configure_view(
    stroke=None,
).configure_title(
    fontSize=24,
    anchor='start',
)
Coefficient/Slope of line of best fit:  2.0991846069651716
Y-Intercept of line of best fit:  44.001020482631276

Observation: From the linear regression model, I observe that the coefficient (or slope) for the variable ‘Schooling’ is approximately 2.10. This implies that for every additional year of schooling, we can expect an increase in life expectancy of about 2.10 years, holding other factors constant.

The y-intercept of approximately 44.00 indicates the estimated life expectancy when the schooling years amount to zero.

This visualization vividly illustrates the positive relationship between schooling and life expectancy: as the number of schooling years increases, so does life expectancy. Furthermore, by grouping the data points based on the ‘Income composition of resources’, I can discern a trend where higher income composition is generally associated with both higher schooling and life expectancy.

regg = LinearRegression()

regg.fit(df[["Income composition of resources"]], df["Life expectancy"])

print("Coefficient/Slope of line of best fit: ", regg.coef_[0])
print("Y-Intercept of line of best fit: ", regg.intercept_)

df["pred"] = regg.predict(df[["Income composition of resources"]])

color_scale = alt.Scale(domain=['Developed', 'Developing'], range=['#00FFA5', '#FF66B2'])

c7 = alt.Chart(df).mark_line(color="yellow").encode(
    x=alt.X("Income composition of resources:Q", title="Income Composition of Resources", scale=alt.Scale(zero=False, nice=True)),
    y=alt.Y("pred:Q", title="Predicted Life Expectancy (years)", scale=alt.Scale(zero=False, nice=True))
)

c4 = alt.Chart(df).mark_circle().encode(
    x=alt.X("Income composition of resources:Q", title="Income Composition of Resources", scale=alt.Scale(zero=False, nice=True)),
    y=alt.Y("Life expectancy:Q", title="Life Expectancy (years)", scale=alt.Scale(zero=False, nice=True)),
    color=alt.Color("Status:N", scale=color_scale, legend=alt.Legend(title="Development Status", titleColor='white', labelColor='white', titleFontSize=12, labelFontSize=10)),
    tooltip=["Schooling", "Income composition of resources", "Adult Mortality", "BMI", "GDP", "Status"]
)

(c4 + c7).properties(
    title={
        "text": 'Linear Regression Model',
        "color": "white",
        "fontSize": 20
    },
    width=600,
    height=400,
    background='#2A2A2A'
).configure_axis(
    grid=True,
    labelColor='white',
    titleColor='white',
    domainColor='white',
    tickColor='white'
).configure_view(
    stroke=None,
).configure_title(
    fontSize=24,
    anchor='start',
    color='white'
)
Coefficient/Slope of line of best fit:  31.921984990976043
Y-Intercept of line of best fit:  49.109992780694796
regg = LinearRegression()

regg.fit(df[["Income composition of resources"]], df["Life expectancy"])

print("Coefficient/Slope of line of best fit: ", regg.coef_[0])
print("Y-Intercept of line of best fit: ", regg.intercept_)

df["pred"] = regg.predict(df[["Income composition of resources"]])

df['Income composition of resources Range'] = pd.cut(df['Income composition of resources'], bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0], labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

c7 = alt.Chart(df).mark_line(color="magenta").encode(
    x=alt.X("Income composition of resources:Q", title="Income Composition of Resources", scale=alt.Scale(zero=False, nice=True)),
    y=alt.Y("pred:Q", title="Predicted Life Expectancy (years)", scale=alt.Scale(zero=False, nice=True))
)

c4 = alt.Chart(df).mark_circle().encode(
    x=alt.X("Income composition of resources:Q", title="Income Composition of Resources", scale=alt.Scale(zero=False, nice=True)),
    y=alt.Y("Life expectancy:Q", title="Life Expectancy (years)", scale=alt.Scale(zero=False, nice=True)),
    color='Income composition of resources Range:N',
    tooltip = ["Income composition of resources", "Adult Mortality", "BMI", "GDP", 'Income composition of resources Range']
)

(c4 + c7).properties(
    title={
        "text": 'Linear Regression Model',
        "fontSize": 20
    },
    width=600,
    height=400,
    background='#F0F0F0'
).configure_axis(
    grid=True
).configure_view(
    stroke=None,
).configure_title(
    fontSize=24,
    anchor='start',
)
Coefficient/Slope of line of best fit:  31.921984990976043
Y-Intercept of line of best fit:  49.109992780694796

Observation: Coefficient/Slope (31.9220): This coefficient indicates the change in the dependent variable (life expectancy) with a one-unit change in the independent variable (income composition of resources). Specifically, for every 0.1 increase in the income composition of resources, we would expect the life expectancy to increase by approximately 3.192 years. This suggests a significant positive relationship between the two variables: higher income composition of resources is associated with higher life expectancy. Y-Intercept (49.1100): The y-intercept represents the estimated life expectancy when the income composition of resources is zero. In this context, it shows that even without any income resources, the model predicts a life expectancy of about 49.11 years. The linear regression analysis revealed a significant positive correlation between life expectancy and the income composition of resources. The slope of the regression line suggests a steep increase in life expectancy as the income composition of resources increases. Additionally, the positive y-intercept value of 49.110 suggests that other factors, besides income composition, also contribute to life expectancy, ensuring a base life expectancy of around 49.110 years. It would be beneficial to investigate this relationship further and consider other variables that might be influencing life expectancy to enhance the predictive accuracy of the model.

➗ Train-Test Split#

Understanding Train-Test Split In machine learning, we often split our dataset into two subsets: the training set and the testing set. This process is known as the train-test split. Here’s why we use it:

Avoiding Overfitting: By setting aside a portion of the data (testing set) and not using it in the model training process, we create a mechanism to check if our model is overfitting. Overfitting occurs when the model learns patterns that are specific to the training data and does not generalize well to new, unseen data. By evaluating the model on unseen data, we can better understand how our model will perform in real-world scenarios.

Assessing Generalization: The main goal of a machine learning model is to make accurate predictions on new, unseen data. By using a train-test split, we can train our model on one subset of the data and then test its performance on a different subset that it hasn’t seen before. This gives us a more reliable estimate of the model’s performance in the real world.

In this analysis, using a train-test split helps ensure that my linear regression model learns the general patterns in the data without overfitting to noise or specific patterns in the training set, thereby ensuring the reliability and stability of our predictions.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X = df[["Income composition of resources"]]
y = df["Life expectancy"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regg = LinearRegression()
regg.fit(X_train, y_train)

y_pred = regg.predict(X_test)

print("Coefficient/Slope of line of best fit: ", regg.coef_[0])
print("Y-Intercept of line of best fit: ", regg.intercept_)

print("Mean Squared Error (MSE): ", mean_squared_error(y_test, y_pred))
Coefficient/Slope of line of best fit:  32.320200613399244
Y-Intercept of line of best fit:  48.79572350364165
Mean Squared Error (MSE):  43.54972876299568

Observation:

These results indicate the performance of our linear regression model when applied to unseen data:

Coefficient/Slope of the line of best fit (32.3202): This means that for every unit increase in the “Income composition of resources,” the “Life expectancy” increases by approximately 32.32 years.

Y-Intercept of the line of best fit (48.7957): This is the estimated “Life expectancy” when the “Income composition of resources” is zero.

Mean Squared Error (MSE) (43.5497): This value represents the average of the squares of the errors between the true and predicted values. The closer this value is to zero, the better the model has performed. My model has a fairly large MSE, suggesting there might be quite a bit of error in the predictions.

Given that the coefficient/slope and y-intercept values we obtained previously were:

Coefficient/Slope of the line of best fit: 31.922 Y-Intercept of the line of best fit: 49.110 and the ones obtained from training the model on the training set were:

Coefficient/Slope of the line of best fit: 32.320 Y-Intercept of the line of best fit: 48.796 The slight difference in these metrics between the two analyses might be due to different data splits into training and testing sets, or potentially some other slight variation in the analyses.

Conclusion: I can see that these values are quite close to each other, further indicating that the model is quite stable and there isn’t a sign of overfitting. Finally, while the model has a reasonable degree of explanatory power, the substantial MSE suggests there is room for improvement, possibly by incorporating additional variables into the model to better capture the factors influencing life expectancy.

📝 Summary#

In this project, I delved deep into a dataset provided by the WHO, though I noticed it contained a considerable number of missing values, predominantly from countries with smaller populations where data collection proves to be a strenuous task. Despite this, I managed to undertake a detailed analysis of the various factors influencing life expectancy, pinpointing schooling and income composition of resources as pivotal determinants through linear regression. I validated these models using train-test-split to ensure their reliability and to avoid overfitting, a process that upheld the significance of my chosen variables. The visualizations crafted not only facilitated a deeper understanding of the intricate relationships between the variables but also shed light on the nations standing at both extremes of the life expectancy spectrum.

📚 References#

  • What is the source of your dataset(s)?

https://www.kaggle.com/kumarajarshi/life-expectancy-who

  • List any other references that you found helpful.

https://scikit-learn.org/stable/modules/impute.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

https://medium.com/search?q=python+linear+regression

https://datagy.io/mean-squared-error-python/

Created in deepnote.com Created in Deepnote